TF-IDF Embedding¶
  • Compute the TF-IDF embeddings for a given sentence.
    • TF-IDF Embedding: The TfidfVectorizer computes a TF-IDF vector for the sentence, resulting in an array of weights for each term.
  • Store the embeddings in a vector database using FAISS.
    • FAISS Storage: FAISS stores these embeddings in an index, allowing efficient similarity searches.
  • Display the embedded data in a simple plot to show the embeddings and their index positions.
    • Retrieval and Plotting: The stored vector is visualized by plotting each TF-IDF weight with its index. The tfidf_faiss function returns the indices and distances of nearest neighbors.
In [ ]:
%pip install -q faiss-cpu langchain matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.4 which is incompatible.
scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.4 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.26.4 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import faiss
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_faiss_visualization(sentence):
    # Step 1: Generate TF-IDF embeddings
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([sentence])
    tfidf_array = tfidf_matrix.toarray()
    feature_names = vectorizer.get_feature_names_out()  # Retrieve terms corresponding to features

    # Step 2: Initialize FAISS index
    dimension = tfidf_array.shape[1]
    index = faiss.IndexFlatL2(dimension)  # Using L2 distance metric OR #index = faiss.IndexFlatIP(dimension)
    faiss.normalize_L2(tfidf_array.astype('float32'))       # Normalize for better retrieval

    # Step 3: Add embeddings to FAISS
    index.add(tfidf_array.astype('float32'))  # FAISS requires float32 type
    
    # Step 4: Retrieve the nearest neighbor for each vector (self-similarity)
    distances, indices = index.search(tfidf_array, k=1)  # Self-search to get its own index and distance
    
    # Step 5: Visualize stored embeddings and nearest neighbor distances
    plt.figure(figsize=(9, 5))
    plt.bar(range(dimension), tfidf_array[0], color='skyblue')
    plt.xlabel("TF-IDF Feature Index")
    plt.ylabel("TF-IDF Weight")
    plt.title("TF-IDF Embedding Weights for Each Feature in the Sentence")
    plt.show()

    # Step 6: Display embedded data in a table with indices, distances, and terms
    table_data = {
        "Index": list(range(dimension)),
        "TF-IDF Weight": tfidf_array[0],
        "Nearest Neighbor Distance (IndexFlatL2)": [distances[0][0]] * dimension,
        "Term": feature_names
    }
    table_df = pd.DataFrame(table_data)
    
    # Print the table
    print("\nTF-IDF Embedding Table with Index, Term, TF-IDF Weight, and Nearest Neighbor Distance\n")
    display(table_df)

# Test the function with an example sentence
tfidf_faiss_visualization("""The financial services sector has experienced robust growth due to the adoption of digital banking and financial technology.
Our firm's investment banking division saw a revenue increase of 20% year-over-year, driven by higher client acquisition and new advisory services.
For the fiscal year ending in 2023, the company reported revenue of $30 million with an EBITDA margin of 30%.
The debt-to-equity ratio remains low at 0.4, providing a stable foundation for future investments and expansion in asset management.
Additionally, the company's market share in wealth management grew by 7%, attributed to expanded service offerings and improved client retention.""")
No description has been provided for this image
TF-IDF Embedding Table with Index, Term, TF-IDF Weight, and Nearest Neighbor Distance

IndexTF-IDF WeightNearest Neighbor Distance (IndexFlatL2)Term
00.072547625011001160.020
10.072547625011001160.02023
20.145095250022002330.030
30.072547625011001160.0acquisition
40.072547625011001160.0additionally
50.072547625011001160.0adoption
60.072547625011001160.0advisory
70.072547625011001160.0an
80.290190500044004650.0and
90.072547625011001160.0asset
100.072547625011001160.0at
110.072547625011001160.0attributed
120.145095250022002330.0banking
130.145095250022002330.0by
140.145095250022002330.0client
150.145095250022002330.0company
160.072547625011001160.0debt
170.072547625011001160.0digital
180.072547625011001160.0division
190.072547625011001160.0driven
200.072547625011001160.0due
210.072547625011001160.0ebitda
220.072547625011001160.0ending
230.072547625011001160.0equity
240.072547625011001160.0expanded
250.072547625011001160.0expansion
260.072547625011001160.0experienced
270.145095250022002330.0financial
280.072547625011001160.0firm
290.072547625011001160.0fiscal
300.145095250022002330.0for
310.072547625011001160.0foundation
320.072547625011001160.0future
330.072547625011001160.0grew
340.072547625011001160.0growth
350.072547625011001160.0has
360.072547625011001160.0higher
370.072547625011001160.0improved
380.21764287503300350.0in
390.072547625011001160.0increase
400.072547625011001160.0investment
410.072547625011001160.0investments
420.072547625011001160.0low
430.145095250022002330.0management
440.072547625011001160.0margin
450.072547625011001160.0market
460.072547625011001160.0million
470.072547625011001160.0new
480.290190500044004650.0of
490.072547625011001160.0offerings
500.072547625011001160.0our
510.072547625011001160.0over
520.072547625011001160.0providing
530.072547625011001160.0ratio
540.072547625011001160.0remains
550.072547625011001160.0reported
560.072547625011001160.0retention
570.145095250022002330.0revenue
580.072547625011001160.0robust
590.072547625011001160.0saw
600.072547625011001160.0sector
610.072547625011001160.0service
620.145095250022002330.0services
630.072547625011001160.0share
640.072547625011001160.0stable
650.072547625011001160.0technology
660.4352857500660070.0the
670.21764287503300350.0to
680.072547625011001160.0wealth
690.072547625011001160.0with
700.21764287503300350.0year